Parallel processing method for inverse matrix for shared memory type scalar parallel computer

ABSTRACT

An LU decomposition is carried out on a block E and H. Then, a block B is updated using an upper triangular portion of the block E, and a block D is updated using a lower triangular portion of the block E. At this time, in an LU decomposition, blocks F and I have been updated. Then, using the blocks B, D, F, and H, blocks A, C, G, and I are updated, an upper triangular portion of the block E is updated, and finally, the blocks D and F are updated. Then, the second updating process is performed on the block E. Using the result of the process, the blocks B and H are updated. Finally, the block E is updated, and the pivot interchanging process is completed, thereby terminating the process. These processes on the blocks are performed in a plurality of divided threads in parallel.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to an arithmetic process using ashared memory type scalar parallel computer.

[0003] 2. Description of the Related Art

[0004] Conventionally, when an inverse matrix of a matrix is obtainedusing a vector computer, a computing method based on the Gauss-Jordanmethod and that operations are quickly performed by utilizingfast-access memory. For example, the methods such as a double unrollingmethod, etc. in which an instruction in a loop is unrolled into a singlelist of instructions and executed once.

[0005] Described below is the method of obtaining an inverse matrix inthe Gauss-Jordan method (or referred to simply as the Gauss method). (Inthe explanation below, the interchange of pivots is omitted, but aprocess of interchanging row vectors for the interchange of pivots isactually performed).

[0006] Assume that A indicates a matrix for which an inverse matrixshould be calculated, and x and y indicate arbitrary column vectors.Ax=y is expressed as simultaneous linear equations as follows whenmatrix elements are explicitly written.

a ₁₁ x ₁ +a ₁₂ x ₂ + * * * +a _(1n) x _(n) =y ₁

a ₂₁ x ₁ +a ₂₂ x ₂ + * * * +a _(2n) x _(n) =y ₂

a _(n1) x ₁ +a _(n2) x ₂ + * * * +a _(nn) x _(n) =y ₂

[0007] If the above listed equations are transformed into By=x, then Bis an inverse matrix of A, thereby obtaining an inverse matrix.

[0008] 1) The equation in the first row is divided by all.

[0009] 2) Compute the i-th row (i>1)−the first row x a_(i1).

[0010] 3) To obtain the coefficient of 1 for x2 of the equation in thesecond row, multiply the second row by a reciprocal of the coefficientof x2.

[0011] 4) Compute the i-th row (i>2)−the second row x a_(i2).

[0012] 5) The above mentioned operation is continued up to the (n−1)throw.

[0013] The interchange of column vectors accompanying the interchange ofpivots is described below.

[0014] Both sides of Ax=y are multiplied by the matrix P correspondingto the interchange of pivots.

PAx=Py=z

[0015] When the following equation holds with the matrix B,

x=Bz

[0016] then B is expressed as follows.

B=(PA)⁻¹ =A ⁻¹ p ⁻¹

[0017] That is, by right-multiplying the obtained B by P, an inversematrix of A can be obtained. Actually, it is necessary to interchangethe column vectors.

[0018] In the equation, P=P_(n)P_(n−1) . . . P₁, and P_(n) is anorthogonal transform having matrix elements of P_(ii)=0, P_(ij)=1,P_(jj)=0, and P_(ji)=1.

[0019] In the vector computer, an inverse matrix is computed in theabove mentioned method based on the assumption that the operation of thememory access system can be quickly performed. However, in the case ofthe shared memory type scalar computer, the frequency of accessingshared memory increases with an increasing size of a matrix to becomputed, thereby largely suppressing the performance of a computer.Therefore, it is necessary to perform the above mentioned matrixcomputation by utilizing the fast-accessing cache memory provided foreach processor of the shared memory type scalar computer. That is, sincethe shared memory is frequently accessed if computation is performed oneach row or column of a matrix, it is necessary to use an algorithm oflocalizing the computation assigned to processors by dividing the matrixinto blocks, each processor processing the largest possible amount ofdata stored in the cache memory, and then accessing the shared memory,thereby reducing the frequency of accessing the shared memory.

SUMMARY OF THE INVENTION

[0020] The present invention aims at providing a method of computing aninverse matrix at a high speed in a parallel process.

[0021] The method according to the present invention is a parallelprocessing method for an inverse matrix for shared memory type scalarparallel computer, and includes the steps of: specifying a predeterminedsquare block in a matrix for which an inverse matrix is to be obtained;decomposing the matrix into upper left, left side, lower left, upper,lower, upper right, right side, and lower right blocks surrounding thesquare block positioned in the center; dividing each of the decomposedblocks into the number of processors and LU-decomposing the square blockand the lower, right side, and lower right blocks in parallel; updatingthe left side, upper, lower, and right side blocks in parallel in arecursive program, and further updating in parallel using the blocksupdated in the recursive program on the upper left, lower left, upperright, and lower right blocks; updating a predetermined square block inplural stages using one processor; and setting the position of thesquare block such that it can sequentially move on the diagonal line ofthe matrix, and obtaining an inverse matrix of the matrix by repeatingthe above mentioned steps.

[0022] According to the present invention, the arithmetic operation ofan inverse matrix is an updating process performed on each block, andeach block is updated in parallel in a plurality of processors. Thus,the shared memory is accessed after the largest possible amount ofoperations are performed on a block stored in the cache provided foreach processor, thereby realizing a localizing algorithm and performinga high-speed arithmetic operation for an inverse matrix.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023]FIG. 1 shows the configuration of the hardware of the sharedmemory type scalar computer according to an embodiment of the presentinvention;

[0024]FIG. 2 is a diagram (1) showing the order of computation accordingto an embodiment of the present invention;

[0025]FIG. 3 is a diagram (2) showing the order of computation accordingto an embodiment of the present invention;

[0026]FIG. 4 is a diagram (3) showing the order of computation accordingto an embodiment of the present invention;

[0027]FIG. 5 is a diagram (4) showing the order of computation accordingto an embodiment of the present invention;

[0028]FIG. 6 is a diagram (5) showing the order of computation accordingto an embodiment of the present invention;

[0029]FIGS. 7A through 7F are diagrams (6) showing the order ofcomputation according to an embodiment of the present invention;

[0030]FIGS. 8A through 8C are diagrams (7) showing the order ofcomputation according to an embodiment of the present invention;

[0031]FIG. 9 shows a pseudo code (1) according to an embodiment of thepresent invention;

[0032]FIG. 10 shows a pseudo code (2) according to an embodiment of thepresent invention;

[0033]FIG. 11 shows a pseudo code (3) according to an embodiment of thepresent invention;

[0034]FIG. 12 shows a pseudo code (4) according to an embodiment of thepresent invention;

[0035]FIG. 13 shows a pseudo code (5) according to an embodiment of thepresent invention;

[0036]FIG. 14 shows a pseudo code (6) according to an embodiment of thepresent invention; and

[0037]FIG. 15 shows a pseudo code (7) according to an embodiment of thepresent invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0038] The present embodiment provides an algorithm of obtaining aninverse matrix of a given matrix. In the shared memory type scalarcomputer, it is necessary to increase a ratio of computation to load.Therefore, according to an embodiment of the present invention, a methodof dividing a matrix into blocks as matrix products so that anarithmetic operation can be efficiently performed. Furthermore, thecomputation density is enhanced by updating part of the matrix with arecursively calculated product of matrices of varying dimensions.

[0039]FIG. 1 shows the configuration of the hardware of the sharedmemory type scalar computer according to an embodiment of the presentinvention.

[0040] Processors 10-1 through 10-n have primary cache memory, and theprimary cache memory can also be incorporated into a processor. Each ofthe processors 10-1 through 10-n is provided with secondary cache memory13-1 through 13-n, and the secondary cache memory 13-1 through 13-n areconnected to an interconnection network 12. The interconnection network12 is provided with memory modules 11-1 through 11-n which are sharedmemory. The processors 10-1 through 10-n read necessary data forarithmetic operations from the memory modules 11-1 through 11-n, storethe data in the secondary cache memory 13-1 through 13-n or the primarycache memory through the interconnection network 12, and performarithmetic operations.

[0041] In this case, reading data from the memory modules 11-1 through11-n to the secondary cache memory 13-1 through 13-n or the primarycache memory, and writing the computed data from the primary cachememory from the secondary cache memory 13-1 through 13-n or the primarycache memory to the memory modules 11-1 through 11-n are performed muchmore slowly than the operation speed of the processors 10-1 through10-n. Therefore, if these writing and reading operations are frequentlyperformed, then the performance of the entire computer is badly lowered.

[0042] Therefore, to maintain high performance of the entire computer,an algorithm of reducing the access to the memory modules 11-1 through11-n and performing the largest possible amount of computation in alocal system formed by the secondary cache memory 13-1 through 13-n, theprimary cache memory, and the processors 10-1 through 10-n is required.

[0043] Therefore, according to an embodiment of the present invention,computation for an inverse matrix is performed as follows.

[0044] Only a rightward updating process is performed on a block havinga certain block width in a matrix to be computed so that the informationused in the updating process outside a block can be stored. That is, toupdate the remaining portion in the outer product of the row vectororthogonal to the column vector selected as the center in the pastprocess is performed in the Gauss-Jordan method described above byreferring to the conventional technology. That is, the method 2) of theabove mentioned Gauss-Jordan is explicitly written as follows.

a _(ij) −c _(1j) ×a _(i1) (where i, j>1: c _(1j) =a _(1j) /a ₁₁)

[0045] This equation means that by selecting the column vector in thefirst column as the center and selecting the row vector in the first roworthogonal to the selected vector, the matrix formed by the second andsubsequent rows x the second and subsequent columns is updated by theouter product of the column vector in the first column and the rowvector in the first row.

[0046] Furthermore, Ax=y can be rewritten using I as a unit matrix intoAx−Iy, and when A is converted in the Gauss-Jordan method, the unitmatrix in the right side is also converted.

[0047] In updating the unit matrix, the element (i, i) of the columncorresponding to the coefficient deleted in A is 1/a_(ii) in the rightmatrix, and the number of the other elements in the right matrix isobtained by multiplying the elements of the column vector deleted in Aby 1/a_(ii) and then by −1. In a column block matrix, left elements arealso to be deleted.

[0048]FIGS. 2 through 8 show the order of computation according to anembodiment of the present invention.

[0049] The order of computation is described below.

[0050] First, in FIG. 2, the matrix M is divided into blocks of columnblock matrices from the left side, and the updating process is performedin the above mentioned method.

[0051] 1) E and H are LU decomposed (refer to Patent Application No.Hei-12-358232).

[0052] 2) B is updated by B←BU⁻¹ using the upper triangular portion U ofE.

[0053] 3) D←L⁻¹D and F←L⁻¹D using the lower triangular portion of E.

[0054] 4) A, C, G, and I are updated.

A←A−B×D, C←C−B×F, G←G−H×D, I←I−H×F

[0055] 5) The upper triangular portion of E is updated. Since thisportion has not been updated in the LU decomposition, it is updated atthis timing.

[0056] 6) The update of the upper portion of D and F is performedaccording to the information about D, E, and F.

[0057] At this time, the updating process is performed using a matrixproduct in the recursive program to enhance the computation density.

[0058] a) Recursive Program

[0059] The upper portions of D and F can be computed by the outerproducts of the respective row vectors and the column vector orthogonalto the row vector of E. To enhance the the density of computation, thefollowing recursive program is used. The upper portions of D and Findicate, as described by referring to FIG. 8, the upper portions of therow blocks of D and F. As clearly shown in FIG. 8, D and F can beupdated from the upper portions of the row blocks.

[0060] recursive subroutine rowupdate ( )

[0061] if (update width<10) then

[0062] The product of the matrix excluding the diagonal element and therow block matrix is subtracted from the upper triangular matrix of E(refer to the explanation of FIG. 8 for details).

[0063] else

[0064] c The update width is divided into a first half and a secondhalf.

[0065] The first half is updated.

[0066] c Then, the second half is updated after calling rowupdate.

[0067] call rowupdate ( )

[0068] return

[0069] end

[0070] 7) Then, while moving the elements on the diagonal line in thelower right direction, the rectangular portion in the column directionon the lower left with respect to the diagonal line is updated. As aresult, the information required to update B and H is generated.

[0071] 8) The potions on the right of B and H are updated. This processis performed as follows.

[0072] With d indicating a block width, the numbers are sequentiallyassigned from left to right starting with i, . . . , i+d−1.

[0073] A reciprocal of a_(ii) is obtained. Other portions are obtainedby dividing each portion by a_(ii), and an inverse sign is assigned.

[0074] The portion on the left of the (i+1)th row is divided bya_(i+1,i+1), and a reciprocal of a_(i+1,i+1) is obtained.

[0075] The left portion is updated by multiplying the row i+1 by thecolumn i+1 of the left portion, and then performing subtraction.

[0076] The above mentioned processes are repeated.

[0077] This portion is further divided into the following procedures.They also require the rearrangement into a recursive algorithm.

[0078] 9) Finally, the rectangular portion in the row direction on theupper left portion of the element on the diagonal line of E is updated.

[0079] 10) Finally performed is the process of interchanging the columnvector into the inverse direction of the history of the interchange ofthe pivots.

[0080] The portion updated in 5) above is updated by E2=E2−a×b whilesliding the diagonal elements of the diagonal-line portion (E2) shown inFIG. 3 along the diagonal line.

[0081] This portion has not been updated in the LU decomposition. Afterthe update, the information about the upper triangular matrix requiredto update D and F can be obtained.

[0082] The portion updated in 7) above is updated by E3=E3−c×d whilesliding the diagonal elements of the diagonal-line portion shown in FIG.4.

[0083] Prior to the update, c is multiplied by the reciprocal of thediagonal element.

[0084] Consider that the diagonal element after the update is thereciprocal of the original diagonal element. The character d isconsidered to be obtained by multiplying each column element by adiagonal element before update, and then multiplying the result by −1.

[0085] As a result, the information about the lower triangular matrixrequired to update the portion on the left of B and H.

[0086] In 9) above, the updating process is performed by E1=E1−a×c whilesliding the diagonal element of the diagonal-line portion shown in FIG.5.

[0087] The ‘a’ after the update is considered to be obtained bymultiplying the column element of a by the original diagonal element,and then by −1.

[0088] D) Details of the parallelism for the shared memory type scalarcomputer

[0089] 0) The LU decomposition on E, F, H, and I is a paralleldecomposition using the parallel algorithm of the LU decomposition.

[0090] 1) Control of block width

[0091] An embodiment of the present invention is provided with a systemof adjusting a block width depending on the scale of a problem and thenumber of processors used in the parallelism.

[0092] Since the block E is updated by one processor, the block width isdetermined to be ignored (about 1%) with consideration given to thetotal cost of the portion.

[0093] 2) In the update in C 4) above, the updating processes by therespective matrix products are performed in parallel. Each processorequally divides the second dimension for parallel sharing incomputation.

[0094] 3) In the update in C 5), that is, the update of D and F isperformed by each processor updating in parallel the areas obtained byequally dividing the second dimension of D and F.

[0095] 4) The update in C 8) above is performed in parallel by eachprocessor by sharing the areas obtained by equally dividing the firstdimension of H.

[0096] 5) Finally, the interchanging process on column vectors isperformed in parallel by equally dividing the first dimension of theentire matrix into areas.

[0097] The broken lines shown in FIG. 6 show an example of dividing anarea in updating matrix portion in parallel.

[0098] Access to Memory in Updating B or H

[0099] Described below is the case in which the recursive programoperates up to the depth of 2.

[0100] FIGS. 7A, and 7B, 7C, and 7D, 7E, and 7F each is a set.

[0101] First updated is the diagonal-line portion shown in FIG. 7A. Atthis time, the diagonal-line portion and the dotted line triangularportion are used.

[0102] Refer to the pseudo code for details. Next, the diagonal-lineportion shown in FIG. 7A is updated using the horizontal-line portionshown in FIG. 7A and the rectangular diagonal-line portion shown in FIG.7B. Then, the horizontal-line portion shown in FIG. 7A is updated usingthe triangular portion in bold lines.

[0103] Next, the diagonal-line portion shown in FIG. 7C is updated bythe product of the right white-painted rectangle shown in FIG. 7C andthe rectangle in bold lines shown in FIG. 7D.

[0104] Then, the diagonal-line portion shown in FIG. 7E is updated usingthe triangular portion in broken lines shown in FIG. 7F. Furthermore,the diagonal-line portion shown in FIG. 7E is updated by the product ofthe horizontal-line portion shown in FIG. 7E and the rectangle withdiagonal lines shown in FIG. 7F. Finally, the horizontal-line portion isupdated using the triangular portion in bold lines shown in FIG. 7F.

[0105] In the above mentioned procedure, reference to E is common.Therefore, E can be stored in the cache of each processor for reference.In addition, for example, reference to and update to B is processed inparallel on the areas in the row direction (areas obtained by dividingby bold broken lines).

[0106] Memory Access in Updating D or F

[0107] Described below is the case in which the recursive programoperates up to the depth of 2.

[0108] First updated is the left diagonal-line portion shown in FIG. 8A.At this time, the diagonal-line portion and the right triangular portionin broken lines shown in FIG. 8A are used.

[0109] Refer to the pseudo code for details. Then, the leftdiagonal-line portion shown in FIG. 8A is updated using the rightrectangular portion with diagonal lines shown in FIG. 8A and the leftvertical-line portion shown in FIG. 8A. Then, the left horizontal-lineportion shown in FIG. 8A is updated using the right triangular portionin bold lines shown in FIG. 8A.

[0110] Next, the left diagonal-line portion shown in FIG. 8B is updatedby the product of the right rectangle in bold lines shown in FIG. 8B andthe left white painted rectangle shown in FIG. 8B.

[0111] Then, the left diagonal-line portion shown in FIG. 8C is updatedusing the right triangular portion in broken lines shown in FIG. 8C. Theleft diagonal-line portion shown in FIG. 8C is updated using the productof the right rectangle with diagonal lines shown in FIG. 8C and the leftvertical-line portion shown in FIG. 8C. Finally, the left vertical-lineportion shown in FIG. 8C is updated using the right triangular portionin bold lines shown in FIG. 8C.

[0112] In the above mentioned procedure, reference to E is common.Therefore, E can be stored in the cache of each processor for reference.Furthermore, the reference to and update of D is processed in parallelon the areas in the row direction (areas obtained by dividing by boldbroken lines).

[0113]FIGS. 9 through 15 show the pseudo code according to an embodimentof the present invention.

[0114]FIG. 9 shows the pseudo code of the main algorithm of theparallelism algorithm of an inverse matrix. In the following pseudocode, the row having C at the left end is a comment row. ‘array a(k,n)’is an array storing the elements of the matrix whose inverse matrix isto be obtained. ‘ip (n)’ is an array storing the information used ininterchanging rows in the LU decomposition subroutine. For the algorithmof the LU decomposition subroutine, refer to Japanese Patent ApplicationLaid-open No.Hei-12-358232. ‘nb’ refers to the number of blocksspecified when the LU decomposition is performed.

[0115] When the LU decomposition is completed on one specified block,and if ip(i) is larger than i, then the i-th row of the matrix isinterchanged with the ip(i)th row. Then, the update subroutine iscalled, and the matrix is updated. The processes from the LUdecomposition to the update subroutine are repeatedly performed untilthe processes are performed on all specified blocks. For the finalspecified block, another LU decomposition and update are performed,thereby terminating the process.

[0116]FIG. 10 shows the pseudo code for updating the remaining portionsaccording to the information about the LU decomposition.

[0117] In the update routine shown in FIG. 10, the updating process isperformed on each of the blocks A through H. For the blocks A through Dand G, an exclusive subroutine is further read. Since the block I hasalready been updated when the LU decomposition is carried out, thesubroutine is not further prepared here.

[0118] When the update routine terminates for the blocks A through D andG, the barrier synchronization is attained. Then, if the number of theprocessor (thread number) is 1, then the ‘update 1 of e’ is performed.This process is performed by the e-update1 subroutine. After theprocess, the barrier synchronization is attained.

[0119] ‘len’ indicates the width of a block to be processed in onethread. ‘is’ indicates the first position of the block to be processed,and ‘ie’ is the last location of the block to be processed. ‘df-update’indicates the subroutine for updating the blocks D and F. If the blocksD and F have been updated, then the first position of a block with ablock width added is stored as the first position (nbase2) of a newblock, ‘len’ is newly computed, the first and last positions of theblocks ‘is2’ and ‘ie2’ is newly computed, and D and F are updated bydf-update, thereby attaining the barrier synchronization.

[0120] Additionally, as for ‘update 2 of e’, when the thread number is1, the update subroutine e-update2 of the block E is called for andbarrier synchronization is performed. Similarly, as described above,‘len’, ‘is’, and ‘ie’ are computed, the update routine bh-update iscalled for the blocks B and H, ‘nbase2’ is obtained, ‘len’, ‘is2’, and‘ie2’ are obtained, and the process is performed by bh-update again,thereby attaining the barrier synchronization.

[0121] Furthermore, when the thread number is 1, the process isperformed by u-update3 as ‘update 3 of e’, thereby attaining the barriersynchronization.

[0122] Afterwards, to restore the state of interchanged pivots to theoriginal state, ‘len’, ‘is’, and ‘ie’ are computed, then columns areinterchanged by the subroutine ‘exchange’, the barrier synchronizationis attained, and the thread is deleted, thereby terminating the process.

[0123]FIG. 11 shows the pseudo code of the update subroutine of theblocks B and D.

[0124] In updating the block B, the subroutine b-update accesses ashared matrix a(k,n), and ‘len’, ‘is1’, and ‘ie1’ having the samemeanings as the explanation above are computed. ‘iof’ indicates thenumber of the starting column of the block B. Then, using the matrixTRU-U having the diagonal element of the upper triangular matrix set to1, the block B of the array a is updated by the equation shown in FIG.11. The expression ‘is:ie’ indicates the process from the matrix element‘is’ to ‘ie’.

[0125] In updating the block D, the subroutine d-update computes asimilar parameter, and updates the matrix a by the equation shown inFIG. 11 according to the lower triangular matrix TRL in the block E.

[0126]FIG. 12 shows the pseudo code of the update subroutine for theblocks C and A.

[0127] In the update subroutine c-update for the block C, the block C isupdated by the multiplication of the blocks B and F. a (1:iof, is2:ie2)indicates the block C, a (1:iof, iof+1:iof+blk) indicates the block B, a(iof+1:iof+blk, is2:ie2) indicates the block F.

[0128] In the update subroutine a-update of the block A, the block A isupdated using the blocks B and D. a (1:iof, is2:ie2) indicates the blockA, a (1:iof, iof+1:iof+blk) indicates the block B, a (iof+1:iof+blk,is2:ie2) indicates the block D.

[0129]FIG. 13 shows the pseudo code indicating the first and secondupdate of the blocks G and E.

[0130] In the update subroutine a-update of the block G, as in the abovementioned subroutines, ‘len’, ‘is2’, ‘ie2’, ‘iof’, etc. indicating thewidth, the starting position, the ending position, etc. of a block arecomputed, and the block G is updated using the blocks D and H. a(iof+1:n, is2:ie2) indicates the block G, a (iof+1:n, iof+1:iof+blk)indicates the block H, and a (oif+1:iof+blk, is2:ie2) indicates theblock D.

[0131] In the first update subroutine e-update1 of the block E, thetriangular matrix above the diagonal elements of E is updated using thecolumn vector s (1:i, i) before the diagonal elements and the row vector(i, i+1:blk) after the diagonal elements.

[0132] In the second update subroutine e-update2 of the block E, thediagonal element of the upper triangular matrix of the block E isupdated into the value obtained by dividing the element value before theupdate by the diagonal element value, updated using the row vector s (i,1:i−1) before the diagonal element and the column vector s (i+1:blk, i)after the diagonal element, updated into the value obtained by dividingthe element value of the lower triangular matrix of the block E by thevalue obtained by changing the sign of the diagonal element, and thediagonal element is updated into the reciprocal of the diagonal elementof the block E.

[0133]FIG. 14 shows the pseudo code of the final update for the block E,and the update subroutine for the blocks D and F.

[0134] In the final update subroutine e-update3 of the block E, theupper triangular matrix of the block E is updated by the column vector s(1:i−1, i) before the diagonal element and the row vector s (i, 1:i−1),and is updated by multiplying the element before the diagonal element ofthe block E by the diagonal element before update.

[0135] In the update subroutine df-update for the blocks D and F, if thewidth len of a block is smaller than 10, the block D or F (depending onthe argument ‘is’ or ‘ie’ of the subroutine) is updated by the elementvalue s (1:i−1, i) and its own row vector a (i, is:ie). The elementvalue of the block D or F is expressed by a (1:i−1, is:ie) so that whenthe subroutine reads a matrix element, a read position is offset by theabove mentioned nbase, thereby studying the block D or F by computing acolumn number for the element values 1˜i−1. When ‘len’ is 20 or largerand 32 or smaller, len1 and len2 are defined, df-update id recursivelycalled, the process shown in FIG. 14 is performed, and df-update isfurther called, thereby terminating the process.

[0136]FIG. 15 shows the pseudo code of the update subroutine of theblocks B and H.

[0137] In FIG. 15, bh-update performs the updating process by theoperation shown in FIG. 15 when ‘len’ is smaller than 10. If ‘len’ is 20or larger and 32 or smaller, then len1 and len2 are defined. Otherwise,len1 and len2 are separately defined, bh-update is called, the operationis performed by the equation shown in FIG. 15, and bh-update is furthercalled, thereby terminating the process.

[0138] According to an embodiment of the present invention, for the samefunction (another method of obtaining an inverse matrix after the LUdecomposition), as compared with the function of the numeric operationlibrary SUN Performance library of SUN, the process can be performedusing 7 CPUs at a speed 6.6 times higher.

[0139] Refer to the following textbook for the common algorithm ofmatrix computation.

[0140] G. H. Golub and C. F. Van Loan “Matrix Computations” The JohnsHopkins University Press, Third edition 1996

[0141] According to the present invention, a method of solving aninverse matrix can be realized with high performance and scalability.

What is claimed is:
 1. A parallel processing method for an inversematrix for a shared memory type scalar parallel computer, comprising:specifying a predetermined square block in a matrix for which an inversematrix is to be obtained; decomposing the matrix into upper left, leftside, lower left, upper, lower, upper right, right side, and lower rightblocks surrounding the square block positioned in the center; dividingeach of the decomposed blocks into the number of processors and LUdecomposing the square block and the lower, right side, and lower rightblocks in parallel; updating the left side, upper, lower, and right sideblocks in parallel in a recursive program, and further updating inparallel using the blocks updated in the recursive program on the upperleft, lower left, upper right, and lower right blocks; updating apredetermined square block in plural stages using one processor; andsetting the position of the square block such that it can sequentiallymove on the diagonal line of the matrix, and obtaining an inverse matrixof the matrix by repeating the above mentioned steps.
 2. The methodaccording to claim 1, wherein said shared memory type scalar parallelcomputer comprises a plurality of processors, plural units of cachememory provided for respective processors, plural units of sharedmemory, and an interconnection network for connection to be made suchthat the units can be communicated.
 3. The method according to claim 1,wherein said method is used to realize a Gauss-Jordan method forparallel computation for each block.
 4. The method according to claim 1,wherein a width of a division used when each block is divided forparallel computation is set such that a total amount of computation ofsquare blocks not processed in parallel can be about 1% of entirecomputation from a size of a matrix from which an inverse matrix isobtained and from a number of processors available in a parallelprocess.
 5. A program for realizing a parallel processing method for aninverse matrix for a shared memory type scalar parallel computer,comprising: specifying a predetermined square block in a matrix forwhich an inverse matrix is to be obtained; decomposing the matrix intoupper left, left side, lower left, upper, lower, upper right, rightside, and lower right blocks surrounding the square block positioned inthe center; dividing each of the decomposed blocks into the number ofprocessors and LU decomposing the square block and the lower, rightside, and lower right blocks in parallel; updating the left side, upper,lower, and right side blocks in parallel in a recursive program, andfurther updating in parallel using the blocks updated in the recursiveprogram on the upper left, lower left, upper right, and lower rightblocks; updating a predetermined square block in plural stages using oneprocessor; and setting the position of the square block such that it cansequentially move on the diagonal line of the matrix, and obtaining aninverse matrix of the matrix by repeating the above mentioned steps. 6.The program according to claim 5, wherein said shared memory type scalarparallel computer comprises a plurality of processors, plural units ofcache memory provided for respective processors, plural units of sharedmemory, and an interconnection network for connection to be made suchthat the units can be communicated.
 7. The program according to claim 5,wherein said method is used to realize a Gauss-Jordan method forparallel computation for each block.
 8. The program according to claim5, wherein a width of a division used when each block is divided forparallel computation is set such that a total amount of computation ofsquare blocks not processed in parallel can be about 1% of entirecomputation from a size of a matrix from which an inverse matrix isobtained and from a number of processors available in a parallelprocess.